Codebook Based Handwritten and Printed Arabic Text Zone Classification
نویسنده
چکیده
In this work, we present a method for classifying handwritten and printed Arabic text zones in noisy document images. We use Three-Adjacent-Segment (TAS) [8] based features which capture properties of a script. We construct two different codebooks of the local shape features extracted from a set of handwritten and printed Arabic documents and use it to train both Support Vector Machine and Fisher’s linear discriminant classifiers using normalized histograms. Due to robustness of TAS features to noise the proposed classification scheme is suitable for noisy document images where performance of other methods degrades drastically. Our experiments show that we can achieve 90–95% classification accuracy. This method is also robust to segmentation results which may contain segments at word, line or paragraph level.
منابع مشابه
Shape codebook based handwritten and machine printed text zone extraction
In this paper, we present a novel method for extracting handwritten and printed text zones from noisy document images with mixed content. We use Triple-Adjacent-Segment (TAS) based features which encode local shape characteristics of text in a consistent manner. We first construct two codebooks of the shape features extracted from a set of handwritten and printed text documents respectively. We...
متن کاملOff-line Arabic Handwritten Recognition Using a Novel Hybrid HMM-DNN Model
In order to facilitate the entry of data into the computer and its digitalization, automatic recognition of printed texts and manuscripts is one of the considerable aid to many applications. Research on automatic document recognition started decades ago with the recognition of isolated digits and letters, and today, due to advancements in machine learning methods, efforts are being made to iden...
متن کاملLanguage identification for handwritten document images using a shape codebook
Language identification for handwritten document images is an open document analysis problem. In this paper, we propose a novel approach to language identification for documents containing mixture of handwritten and machine printed text using image descriptors constructed from a codebook of shape features. We encode local text structures using scale and rotation invariant codewords, each repres...
متن کاملZone Based Features for Handwritten and Printed Mixed Kannada Digits Recognition
In the field of Optical Character Recognition (OCR), zoning is used to extract topological information from patterns. In this paper we propose Zone based features for recognition of the mixer of Handwritten and Printed Kannada Digits. A digit image is divided into 64 zones and pixel density is computed for each zone. This procedure is sequentially repeated for entire zone. Finally 64 features a...
متن کاملDocument Analysis And Classification Based On Passing Window
In this paper we present Document analysis and classification system to segment and classify contents of Arabic document images. This system includes preprocessing, document segmentation, feature extraction and document classification. A document image is enhanced in the preprocessing by removing noise, binarization, and detecting and correcting image skew. In document segmentation, an algorith...
متن کامل